{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "# Non-linear feature engineering for Logistic Regression\n",
    "\n",
    "In the slides at the beginning of the module we mentioned that linear\n",
    "classification models are not suited to non-linearly separable data.\n",
    "Nevertheless, one can still use feature engineering as previously done for\n",
    "regression models to overcome this issue. To do so, we use non-linear\n",
    "transformations that typically map the original feature space into a higher\n",
    "dimension space, where the linear model can separate the data more easily.\n",
    "\n",
    "Let us illustrate this on three synthetic datasets. Each dataset has two\n",
    "original features and two classes to make it easy to visualize. The first\n",
    "dataset is called the \"moons\" dataset as the data points from each class are\n",
    "shaped as a crescent moon:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.datasets import make_moons\n",
    "\n",
    "feature_names = [\"Feature #0\", \"Feature #1\"]\n",
    "target_name = \"class\"\n",
    "\n",
    "X, y = make_moons(n_samples=100, noise=0.13, random_state=42)\n",
    "\n",
    "# We store both the data and target in a dataframe to ease plotting\n",
    "moons = pd.DataFrame(\n",
    "    np.concatenate([X, y[:, np.newaxis]], axis=1),\n",
    "    columns=feature_names + [target_name],\n",
    ")\n",
    "data_moons, target_moons = moons[feature_names], moons[target_name]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The second dataset is called the \"Gaussian quantiles\" dataset as all data\n",
    "points are sampled from a 2D Gaussian distribution regardless of the class.\n",
    "The points closest to the center are assigned to the class 1 while the points\n",
    "in the outer edges are assigned to the class 0, resulting in concentric\n",
    "circles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_gaussian_quantiles\n",
    "\n",
    "X, y = make_gaussian_quantiles(\n",
    "    n_samples=100, n_features=2, n_classes=2, random_state=42\n",
    ")\n",
    "gauss = pd.DataFrame(\n",
    "    np.concatenate([X, y[:, np.newaxis]], axis=1),\n",
    "    columns=feature_names + [target_name],\n",
    ")\n",
    "data_gauss, target_gauss = gauss[feature_names], gauss[target_name]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The third dataset is called the \"XOR\" dataset as the data points are sampled\n",
    "from a uniform distribution in a 2D space and the class is defined by the\n",
    "Exclusive OR (XOR) operation on the two features: the target class is 1 if\n",
    "only one of the two features is greater than 0. The target class is 0\n",
    "otherwise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xor = pd.DataFrame(\n",
    "    np.random.RandomState(0).uniform(low=-1, high=1, size=(200, 2)),\n",
    "    columns=feature_names,\n",
    ")\n",
    "target_xor = np.logical_xor(xor[\"Feature #0\"] > 0, xor[\"Feature #1\"] > 0)\n",
    "target_xor = target_xor.astype(np.int32)\n",
    "xor[\"class\"] = target_xor\n",
    "data_xor = xor[feature_names]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "We use matplotlib to visualize all the datasets at a glance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 2
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.colors import ListedColormap\n",
    "\n",
    "\n",
    "_, axs = plt.subplots(ncols=3, figsize=(14, 4), constrained_layout=True)\n",
    "\n",
    "common_scatter_plot_params = dict(\n",
    "    cmap=ListedColormap([\"tab:red\", \"tab:blue\"]),\n",
    "    edgecolor=\"white\",\n",
    "    linewidth=1,\n",
    ")\n",
    "\n",
    "axs[0].scatter(\n",
    "    data_moons[feature_names[0]],\n",
    "    data_moons[feature_names[1]],\n",
    "    c=target_moons,\n",
    "    **common_scatter_plot_params,\n",
    ")\n",
    "axs[1].scatter(\n",
    "    data_gauss[feature_names[0]],\n",
    "    data_gauss[feature_names[1]],\n",
    "    c=target_gauss,\n",
    "    **common_scatter_plot_params,\n",
    ")\n",
    "axs[2].scatter(\n",
    "    data_xor[feature_names[0]],\n",
    "    data_xor[feature_names[1]],\n",
    "    c=target_xor,\n",
    "    **common_scatter_plot_params,\n",
    ")\n",
    "axs[0].set(\n",
    "    title=\"The moons dataset\",\n",
    "    xlabel=feature_names[0],\n",
    "    ylabel=feature_names[1],\n",
    ")\n",
    "axs[1].set(\n",
    "    title=\"The Gaussian quantiles dataset\",\n",
    "    xlabel=feature_names[0],\n",
    ")\n",
    "axs[2].set(\n",
    "    title=\"The XOR dataset\",\n",
    "    xlabel=feature_names[0],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "We intuitively observe that there is no (single) straight line that can\n",
    "separate the two classes in any of the datasets. We can confirm this by\n",
    "fitting a linear model, such as a logistic regression, to each dataset and\n",
    "plot the decision boundary of the model.\n",
    "\n",
    "Let's first define a function to help us fit a given model and plot its\n",
    "decision boundary on the previous datasets at a glance:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.inspection import DecisionBoundaryDisplay\n",
    "\n",
    "\n",
    "def plot_decision_boundary(model, title=None):\n",
    "    datasets = [\n",
    "        (data_moons, target_moons),\n",
    "        (data_gauss, target_gauss),\n",
    "        (data_xor, target_xor),\n",
    "    ]\n",
    "    fig, axs = plt.subplots(\n",
    "        ncols=3,\n",
    "        figsize=(14, 4),\n",
    "        constrained_layout=True,\n",
    "    )\n",
    "\n",
    "    for i, ax, (data, target) in zip(\n",
    "        range(len(datasets)),\n",
    "        axs,\n",
    "        datasets,\n",
    "    ):\n",
    "        model.fit(data, target)\n",
    "        DecisionBoundaryDisplay.from_estimator(\n",
    "            model,\n",
    "            data,\n",
    "            response_method=\"predict_proba\",\n",
    "            plot_method=\"pcolormesh\",\n",
    "            cmap=\"RdBu\",\n",
    "            alpha=0.8,\n",
    "            # Setting vmin and vmax to the extreme values of the probability to\n",
    "            # ensure that 0.5 is mapped to white (the middle) of the blue-red\n",
    "            # colormap.\n",
    "            vmin=0,\n",
    "            vmax=1,\n",
    "            ax=ax,\n",
    "        )\n",
    "        DecisionBoundaryDisplay.from_estimator(\n",
    "            model,\n",
    "            data,\n",
    "            response_method=\"predict_proba\",\n",
    "            plot_method=\"contour\",\n",
    "            alpha=0.8,\n",
    "            levels=[0.5],  # 0.5 probability contour line\n",
    "            linestyles=\"--\",\n",
    "            linewidths=2,\n",
    "            ax=ax,\n",
    "        )\n",
    "        ax.scatter(\n",
    "            data[feature_names[0]],\n",
    "            data[feature_names[1]],\n",
    "            c=target,\n",
    "            **common_scatter_plot_params,\n",
    "        )\n",
    "        if i > 0:\n",
    "            ax.set_ylabel(None)\n",
    "    if title is not None:\n",
    "        fig.suptitle(title)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Now let's define our logistic regression model and plot its decision boundary\n",
    "on the three datasets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())\n",
    "logistic_regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_decision_boundary(logistic_regression, title=\"Linear classifier\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "This confirms that it is not possible to separate the two classes with a\n",
    "linear model. On each plot we see a **significant number of misclassified\n",
    "samples on the training set**! The three plots show typical cases of\n",
    "**underfitting** for linear models.\n",
    "\n",
    "Also, the last two plots show soft colors, meaning that the model is highly\n",
    "unsure about which class to choose."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Engineering non-linear features\n",
    "\n",
    "As we did for the linear regression models, we now attempt to build a more\n",
    "expressive machine learning pipeline by leveraging non-linear feature\n",
    "engineering, with techniques such as binning, splines, polynomial features,\n",
    "and kernel approximation.\n",
    "\n",
    "Let's start with the binning transformation of the features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import KBinsDiscretizer\n",
    "\n",
    "classifier = make_pipeline(\n",
    "    KBinsDiscretizer(n_bins=5, encode=\"onehot\"),  # already the default params\n",
    "    LogisticRegression(),\n",
    ")\n",
    "classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_decision_boundary(classifier, title=\"Binning classifier\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "We can see that the resulting decision boundary is constrained to follow\n",
    "**axis-aligned segments**, which is very similar to what a decision tree would\n",
    "do as we will see in the next Module. Furthermore, as for decision trees, the\n",
    "model makes piecewise constant predictions within each rectangular region.\n",
    "\n",
    "This axis-aligned decision boundary is not necessarily the natural decision\n",
    "boundary a human would have intuitively drawn for the moons dataset and the\n",
    "Gaussian quantiles datasets. It still makes it possible for the model to\n",
    "successfully separate the data. However, binning alone does not help the\n",
    "classifier separate the data for the XOR dataset. This is because **the\n",
    "binning transformation is a feature-wise transformation** and thus **cannot\n",
    "capture interactions** between features that are necessary to separate the\n",
    "XOR dataset.\n",
    "\n",
    "Let's now consider a **spline** transformation of the original features. This\n",
    "transformation can be considered a **smooth version of the binning\n",
    "transformation**. You can find more details in the [scikit-learn user guide](\n",
    "https://scikit-learn.org/stable/modules/preprocessing.html#spline-transformer)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import SplineTransformer\n",
    "\n",
    "classifier = make_pipeline(\n",
    "    SplineTransformer(degree=3, n_knots=5),\n",
    "    LogisticRegression(),\n",
    ")\n",
    "classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_decision_boundary(classifier, title=\"Spline classifier\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "We can see that the decision boundary is now smooth, and while it favors\n",
    "axis-aligned decision rules when extrapolating in low density regions, it can\n",
    "adopt a more curvy decision boundary in the high density regions.\n",
    "However, as for the binning transformation, the model still fails to separate\n",
    "the data for the XOR dataset, irrespective of the number of knots, for the\n",
    "same reasons: **the spline transformation is a feature-wise transformation**\n",
    "and thus **cannot capture interactions** between features.\n",
    "\n",
    "Take into account that the number of knots is a hyperparameter that needs to be\n",
    "tuned. If we use too few knots, the model would underfit the data, as shown on\n",
    "the moons dataset. If we use too many knots, the model would overfit the data.\n",
    "\n",
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">Notice that <tt class=\"docutils literal\"><span class=\"pre\">KBinsDiscretizer(encode=\"onehot\")</span></tt> and <tt class=\"docutils literal\">SplineTransformer</tt> do not\n",
    "require additional scaling. Indeed, they can replace the scaling step for\n",
    "numerical features: they both create features with values in the [0, 1] range.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Modeling non-additive feature interactions\n",
    "\n",
    "We now consider feature engineering techniques that non-linearly combine the\n",
    "original features in the hope of capturing interactions between them. We will\n",
    "consider polynomial features and kernel approximation.\n",
    "\n",
    "Let's start with the polynomial features:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import PolynomialFeatures\n",
    "\n",
    "classifier = make_pipeline(\n",
    "    StandardScaler(),\n",
    "    PolynomialFeatures(degree=3, include_bias=False),\n",
    "    LogisticRegression(C=10),\n",
    ")\n",
    "classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_decision_boundary(classifier, title=\"Polynomial classifier\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "We can see that the decision boundary of this polynomial classifier is\n",
    "**smooth** and can successfully separate the data on all three datasets\n",
    "(depending on how we set the values of the `degree` and `C`\n",
    "hyperparameters).\n",
    "\n",
    "It is interesting to observe that this models extrapolates very differently\n",
    "from the previous models: its decision boundary can take a diagonal\n",
    "direction. Furthermore, we can observe that predictions are very confident in\n",
    "the low density regions of the feature space, even very close to the decision\n",
    "boundary.\n",
    "\n",
    "We can obtain very similar results by using a kernel approximation technique\n",
    "such as the Nystr\u00f6m method with a polynomial kernel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 0
   },
   "outputs": [],
   "source": [
    "from sklearn.kernel_approximation import Nystroem\n",
    "\n",
    "classifier = make_pipeline(\n",
    "    StandardScaler(),\n",
    "    Nystroem(kernel=\"poly\", degree=3, coef0=1, n_components=100),\n",
    "    LogisticRegression(C=10),\n",
    ")\n",
    "classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_decision_boundary(classifier, title=\"Polynomial Nystroem classifier\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The polynomial kernel approach would be interesting in cases where the\n",
    "original feature space is already of high dimension: in these cases,\n",
    "**computing the complete polynomial expansion** with `PolynomialFeatures`\n",
    "could be **intractable**, while the Nystr\u00f6m method can control the output\n",
    "dimensionality with the `n_components` parameter.\n",
    "\n",
    "Let's now explore the use of a radial basis function (RBF) kernel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 0
   },
   "outputs": [],
   "source": [
    "from sklearn.kernel_approximation import Nystroem\n",
    "\n",
    "classifier = make_pipeline(\n",
    "    StandardScaler(),\n",
    "    Nystroem(kernel=\"rbf\", gamma=1, n_components=100),\n",
    "    LogisticRegression(C=5),\n",
    ")\n",
    "classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_decision_boundary(classifier, title=\"RBF Nystroem classifier\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The resulting decision boundary is **smooth** and can successfully separate\n",
    "the classes for all three datasets. Furthemore, the model extrapolates very\n",
    "differently: in particular, it tends to be **much less confident in its\n",
    "predictions in the low density regions** of the feature space.\n",
    "\n",
    "As for the previous polynomial pipelines, this pipeline **does not favor\n",
    "axis-aligned decision rules**. It can be shown mathematically that the\n",
    "[inductive bias](https://en.wikipedia.org/wiki/Inductive_bias) of our RBF\n",
    "pipeline is actually rotationally invariant."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Multi-step feature engineering\n",
    "\n",
    "It is possible to combine several feature engineering transformers in a\n",
    "single pipeline to blend their respective inductive biases. For instance, we\n",
    "can combine the binning transformation with a kernel approximation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "lines_to_next_cell": 0
   },
   "outputs": [],
   "source": [
    "classifier = make_pipeline(\n",
    "    KBinsDiscretizer(n_bins=5),\n",
    "    Nystroem(kernel=\"rbf\", gamma=1.0, n_components=100),\n",
    "    LogisticRegression(),\n",
    ")\n",
    "classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_decision_boundary(classifier, title=\"Binning + Nystroem classifier\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "It is interesting to observe that this model is still piecewise constant with\n",
    "axis-aligned decision boundaries everywhere, but it can now successfully deal\n",
    "with the XOR problem thanks to the second step of the pipeline that can\n",
    "model the interactions between the features transformed by the first step.\n",
    "\n",
    "We can also combine the spline transformation with a kernel approximation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.kernel_approximation import Nystroem\n",
    "\n",
    "classifier = make_pipeline(\n",
    "    SplineTransformer(n_knots=5),\n",
    "    Nystroem(kernel=\"rbf\", gamma=1.0, n_components=100),\n",
    "    LogisticRegression(),\n",
    ")\n",
    "classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_decision_boundary(classifier, title=\"Spline + RBF Nystroem classifier\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "The decision boundary of this pipeline is smooth, but with axis-aligned\n",
    "extrapolation.\n",
    "\n",
    "Depending on the task, this can be considered an advantage or a drawback."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Summary and take-away messages\n",
    "\n",
    "- Linear models such as logistic regression can be used for classification on\n",
    "  non-linearly separable datasets by leveraging non-linear feature\n",
    "  engineering.\n",
    "- Transformers such as `KBinsDiscretizer` and `SplineTransformer` can be used\n",
    "  to engineer non-linear features independently for each original feature.\n",
    "- As a result, these transformers cannot capture interactions between the\n",
    "  original features (and then would fail on the XOR classification task).\n",
    "- Despite this limitation they already augment the expressivity of the\n",
    "  pipeline, which can be sufficient for some datasets.\n",
    "- They also favor axis-aligned decision boundaries, in particular in the low\n",
    "  density regions of the feature space (axis-aligned extrapolation).\n",
    "- Transformers such as `PolynomialFeatures` and `Nystroem` can be used to\n",
    "  engineer non-linear features that capture interactions between the original\n",
    "  features.\n",
    "- It can be useful to combine several feature engineering transformers in a\n",
    "  single pipeline to build a more expressive model, for instance to favor\n",
    "  axis-aligned extrapolation while also capturing interactions.\n",
    "- In particular, if the original dataset has both numerical and categorical\n",
    "  features, it can be useful to apply binning or a spline transformation to the\n",
    "  numerical features and one-hot encoding to the categorical features. Then,\n",
    "  the resulting features can be combined with a kernel approximation to model\n",
    "  interactions between numerical and categorical features. This can be\n",
    "  achieved with the help of `ColumnTransformer`.\n",
    "\n",
    "In subsequent notebooks and exercises, we will further explore the interplay\n",
    "between regularization, feature engineering, and the under-fitting /\n",
    "overfitting trade-off.\n",
    "\n",
    "But first we will do an exercise to illustrate the relationship between the\n",
    "Nystr\u00f6m kernel approximation and support vector machines."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}